AITopics | protein sequence

Collaborating Authors

protein sequence

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Steering Generative Models with Experimental Data for Protein Fitness Optimization

Neural Information Processing SystemsJun-22-2026, 21:58:30 GMT

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequencefitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (0.93)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Add feedback

Venus-MAXWELL: Efficient Learning of Protein-Mutation Stability Landscapes using Protein Language Models

Neural Information Processing SystemsJun-22-2026, 17:57:29 GMT

In-silico prediction of protein mutant stability, measured by the difference in Gibbs free energy change ( G), is fundamental for protein engineering. Current sequence-to-label methods typically employ the two-stage pipeline: (i) encoding mutant sequences using neural networks (e.g., transformers), followed by (ii) the G regression from the latent representations. Although these methods have demonstrated promising performance, their dependence on specialized neural network encoders significantly increases the complexity. Additionally, the requirement to individually compute latent representations for each mutant site negatively impacts computational efficiency and poses the risk of overfitting. This work proposes the Venus-MAXWELL framework, which reformulates mutation G prediction as a sequence-to-landscape task. In Venus-MAXWELL, mutations of a protein and their corresponding Gvalues are organized into a landscape matrix, allowing our framework to learn the G landscape of a protein with a single forward and backward pass during training. Besides, to facilitate future works, we also curated a large-scale G dataset with strict controls on data leakage and redundancy to ensure robust evaluation. Venus-MAXWELL is compatible with multiple protein language models and enables these models for accurate and efficient G prediction. For example, when integrated with the ESM-IF, Venus-MAXWELL achieves higher accuracy than ThermoMPNN with 10 faster in inference speed (despite having 50 more parameters than ThermoMPNN).

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evolutionary Reasoning Does Not Arise in Standard Usage of Protein Language Models

Neural Information Processing SystemsJun-22-2026, 12:13:08 GMT

Protein language models (PLMs) are often assumed to capture evolutionary information by training on large protein sequence datasets. Yet it remains unclear whether PLMs can reason about evolution--that is, infer evolutionary relationships between sequences. We test this capability by evaluating whether standard PLM usage, frozen or fine-tuned embeddings with distance-based comparison, supports evolutionary reasoning. Existing PLMs consistently fail to recover phylogenetic structure, despite strong performance on sequence-level tasks such as masked-token and contact prediction. We present PHYLA, a hybrid state-space and transformer model that jointly processes multiple sequences and is trained using a tree-based objective across 3,000 phylogenies spanning diverse protein families.

bioinformatics, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Protein Function Prediction with Contrastive Alignment

Neural Information Processing SystemsJun-15-2026, 17:41:27 GMT

Predicting protein function from sequence is a central challenge in computational biology. While existing methods rely heavily on structured ontologies or similaritybased techniques, they often lack the flexibility to express structure-free functional descriptions and novel biological functions. In this work, we introduce Prot2TextV2, a novel multimodal sequence-to-text model that generates free-form natural language descriptions of protein function directly from amino acid sequences. Our method combines a protein language model as a sequence encoder (ESM-3B) and a decoder-only language model (LLaMA-3.1-8B-Instruct)

bioinformatics, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Asia (0.28)
Europe (0.28)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

pL: 2PApL: 2PApL: 6PApL: 5PApL: 9PApL: 9PADDDDDD903833EEEEEE DDDDDD::::::947000 TTTTTT2126892221121716 77 4508848903

Neural Information Processing SystemsJun-15-2026, 10:55:51 GMT

Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce PRODVA, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, PRODVA achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.62%. 1

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Neural Information Processing SystemsJun-15-2026, 05:26:47 GMT

Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in highthroughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet, in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24-32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.

bioinformatics, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Biomedical Informatics (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Understanding protein function with a multimodal retrieval-augmented foundation model

Neural Information Processing SystemsJun-12-2026, 02:28:19 GMT

Protein language models (PLMs) learn probability distributions over natural protein sequences. By learning from hundreds of millions of natural protein sequences, protein understanding and design capabilities emerge. Recent works have shown that scaling these models improves structure prediction, but does not seem to improve mutation understanding and representation quality for protein function prediction. We introduce PoET-2, a multimodal, retrieval-augmented protein foundation model that incorporates in-context learning of family-specific evolutionary constraints with optional structure conditioning to learn generative distributions over protein sequences. PoET-2 uses a hierarchical transformer encoder that is equivariant to sequence context ordering and a dual decoder architecture with both causal and masked language modeling objectives, allowing PoET-2 to operate in both fully generative and bidirectional representation learning modes. PoET-2 achieves state-of-the-art performance on zero-shot variant effect prediction, excelling at scoring variants with multiple mutations and challenging indel mutations. In supervised settings, PoET-2 embeddings outperform previous methods for learning sequence-function relationships, especially with small datasets. This work highlights the benefits of combining retrieval augmentation with multimodal, family-centric modeling for advancing protein foundation models.

artificial intelligence, natural language, proceedings, (8 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.97)

Add feedback

Graph Denoising Diffusion for Inverse Protein Folding

Neural Information Processing SystemsApr-25-2026, 19:14:48 GMT

Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Education > Health & Safety > School Nutrition (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

materials

Neural Information Processing SystemsApr-24-2026, 20:49:33 GMT

A.1 Access instructions OpenProteinSet is hosted by the Registry of Open Data on AWS (RODA) and can be accessed at the following link: registry.opendata.aws/openfold/. A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content of the dataset is provided on the dataset's landing page. A.3 Data format All OpenProteinSet files are in standard plaintext formats (A3M for MSAs, HHSearch format for template hits, and PDB for structure files) that can be read by a wide variety of bioinformatics software. A.5 License OpenProteinSet is made available under the CCBY 4.0 license. A copy of the license is provided with the dataset.

artificial intelligence, bioinformatics, dataset, (15 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
Law (0.47)

Technology:

Information Technology > Artificial Intelligence (0.70)
Information Technology > Biomedical Informatics (0.50)

Add feedback

Filters

Collaborating Authors

protein sequence

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Steering Generative Models with Experimental Data for Protein Fitness Optimization

Venus-MAXWELL: Efficient Learning of Protein-Mutation Stability Landscapes using Protein Language Models

Evolutionary Reasoning Does Not Arise in Standard Usage of Protein Language Models

Protein Function Prediction with Contrastive Alignment

pL: 2PApL: 2PApL: 6PApL: 5PApL: 9PApL: 9PADDDDDD903833EEEEEE DDDDDD::::::947000 TTTTTT2126892221121716 77 4508848903

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Understanding protein function with a multimodal retrieval-augmented foundation model

20888d00c5df685de2c09790040e0327-Supplemental-Conference.pdf

Graph Denoising Diffusion for Inverse Protein Folding

materials